All outputs will be saved to out/221202_A01366_0326_AHHTTWDMXY/hg38/SHE5052A9_S101/all-well/DGE_filtered/unintegrated/.
Load Parse Biosciences split-pipe pipeline output as a Seurat object with no cut-offs, remove any missing samples, add sample metadata, add summary statistics, and optionally subset samples. Then filter nuclei per gene.
Gene-level filtering is also performed at this point, based on the
min_nuclei_per_gene argument. Genes that are present in
very few nuclei of the dataset (e.g., <3) are uninformative and
unlikely to help in differentiating between groups of nuclei. In
general, most genes removed by this filtering step will be those not
detected in any nuclei, which will help to trim the size of the
object.
The minimum number of nuclei per gene is set to 5.
# seu <- load_parse_to_seurat(
# parse_dir,
# experiment,
# genome,
# sublibrary,
# parse_analysis_subdir,
# min_nFeature_RNA = 0,
# min_nuclei_per_gene = min_nuclei_per_gene,
# sample_subset,
# remove_na_samples = T,
# do_add_sample_metadata = T,
# do_add_summary_stats = T,
# groupings
# )
# # save seu
# saveRDS(seu, file = here::here(out$base, "seu.rds"))
seu <- readRDS(here::here(out$base, "seu.rds"))
seu <- seu[sample(nrow(seu), nrow(seu)*0.5),sample(ncol(seu), ncol(seu)*0.1)]
Nuclei per sample
Nucleus-level filtering is performed, applying cut-offs to
nCount_RNA, nFeature_RNA,
percent_mito, and doublet values of each
nucleus.
Putative doublets are identified using the scDblFinder
package. This package creates artificial doublets by grouping together
random pairs of nuclei within each sample to create pseudo-doublets, and
then identifying nuclei that cluster near to these pseudo-doublets
during dimensionality reduction. This co-clustering indicates that they
carry a doublet signature.
The remove_doublets argument is set to
FALSE.
A dummy column will be introduced, treating all nuclei as though they are singlets.
seu$doublet <- 0
Proportions of transcript types of interest (mitochondrial, ribosomal, globin) are annotated. Mitochondrial genes are particularly of interest for filtering, but it is good practice to inspect the others as well.
percent_mito - A high percentage of mitochondrial genes
indicates death, loss of cytoplasmic RNA, or increased apoptosis.percent_ribo - mRNA that code for ribosomal proteins.
They do not point to specific issues, but it can be good to have a look
at their relative abundance. They can have biological relevance.percent_globin - Globin genes are very abundant in
erythrocytes. Depending on your application, you can expect
‘contamination’ of erythrocytes and select against it.seu <- annotate_proportions_of_transcript_types(seu)
The do_filtering argument is set to
TRUE. Filters will be applied.
seu <- get_filters(
seu,
do_filtering,
remove_doublets,
min_nCount_RNA,
max_nCount_RNA,
min_nFeature_RNA,
max_nFeature_RNA,
max_percent_mito
)
seu <- subset(seu, cells = unique(dplyr::filter(seu@misc$nucleus_filtering, pass)$nucleus))
# save seu_filtered
saveRDS(seu, here::here(out$base, "seu_filtered.rds"))
100% (168/168) of nuclei retained.
Find the most highly variable genes (HVGs) across samples.
seu <- Seurat::FindVariableFeatures(seu)
Seurat::SCTransform() is a function that performs normalisation and scaling of the data.
Variables to regress out of the SCTransform residuals (argument
vars_to_regress) are percent_mito and
nCount_RNA. These variables are prevented from
contributing much to the PCA during dimensionality reduction, and thus
confounding the analysis due to irrelevant variation.
seu <- Seurat::SCTransform(seu, vars.to.regress = vars_to_regress)
## Calculating cell attributes from input UMI matrix: log_umi
## Variance stabilizing transformation of count matrix of size 6889 by 168
## Model formula is y ~ log_umi
## Get Negative Binomial regression parameters per gene
## Using 2000 genes, 168 cells
## Found 37 outliers - those will be ignored in fitting/regularization step
## Second step: Get residuals using fitted parameters for 6889 genes
## Computing corrected count matrix for 6889 genes
## Calculating gene attributes
## Wall clock passed: Time difference of 4.365217 secs
## Determine variable features
## Place corrected count matrix in counts slot
## Regressing out percent_mito, nCount_RNA
## Centering data matrix
## Set default assay to SCT
# save seu_transformed
saveRDS(seu, file = here::here(out$base, "seu_transformed.rds"))
Nuclei are assigned a score representing the likelihood that they are in each phase of the cell cycle. This is based on the expression levels of genes associated with each phase. The highest probability phase of each nucleus is stored in a new “Phase” column of the metadata.
seu <- Seurat::CellCycleScoring(
seu,
s.features = Seurat::cc.genes$s.genes,
g2m.features = Seurat::cc.genes$g2m.genes
)
## Warning: The following features are not present in the object: PCNA, FEN1,
## MCM2, RRM1, GINS2, CDCA7, MLF1IP, RPA2, NASP, GMNN, CCNE2, UBR7, POLD3, RAD51,
## RRM2, CDC45, CDC6, EXO1, TIPIN, DSCC1, CASP8AP2, USP1, CLSPN, POLA1, CHAF1B,
## E2F8, not searching for symbol synonyms
## Warning: The following features are not present in the object: HMGB2, CDK1,
## NUSAP1, UBE2C, BIRC5, TPX2, TOP2A, NDC80, CKS2, CKS1B, MKI67, TMPO, FAM64A,
## SMC4, CCNB2, CKAP2, AURKB, KIF11, ANP32E, GTSE1, KIF20B, CDCA3, HN1, CDC20,
## TTK, CDC25C, NCAPD2, DLGAP5, CDCA2, HMMR, PSRC1, LBR, CKAP5, NEK2, G2E3, CBX5,
## not searching for symbol synonyms
Dimensionality reduction is performed to isolated the strongest signals and to remove background noise.
seu <- Seurat::RunPCA(seu)
## PC_ 1
## Positive: FKBP5, RORA, PDK4, ADAMTS9-AS2, PRUNE2, ACSL1, RNLS, CYP3A5, SULT1C2, BCL2
## BNIP3L, RNF152, ERRFI1, GPX3, ANK3, MYOCOS, RETREG1, AGBL4, ACSS3, PDE7A
## PKHD1, LINC02532, REPS2, LINC00621, SPAG17, SLC9A9, ENSG00000225689, SPSB1, PRKN, FBXO25
## Negative: DCBLD2, MACF1, CDK6, PRKCA, CHST11, IGFN1, LAMB1, MAP1B, RUNX2, KTN1
## SRGAP1, ZSWIM6, CASK, FRMD5, NRP1, WWC1, RGS20, AGPAT4, TRIO, MIR4435-2HG
## ITGA3, DGKH, APP, NEK6, PCED1B, ME1, CARMIL1, ZMIZ1, KIF26B, HSPG2
## PC_ 2
## Positive: ZNF395, SMIM2-AS1, TRABD2B, GPI, PGGHG, CDH4, PITPNC1, WWC1, COL23A1, AGAP1
## PRKD1, NEK6, HIF1A-AS3, LAMA5, TTYH3, ITGA3, ANPEP, VAV2, FBXO17, ASTN2
## RHPN2, EML6, MAST4, ANGPTL4, COL27A1, EVA1A, NRP1, KCNK3, MAML2, RFX2
## Negative: FKBP5, CHST11, IRS2, ABTB2, PPM1L, PCED1B, ACSL1, SQSTM1, EFNA5, ESYT2
## CYP3A5, ADAMTS9-AS2, DOCK4, FBXO25, PDK4, SORT1, ZSWIM6, TLR2, GFPT1, UTRN
## IGFN1, GBE1, CACNA2D1, PDE8A, NAP1L1, PHIP, AGPAT4, ENSG00000291147, DCBLD2, CARD11
## PC_ 3
## Positive: TCF4, LDB2, SHANK3, MGAT4A, EBF1, PECAM1, BTNL9, IQSEC1, PTPRB, PBX1
## GNA14, PRKCH, ENSG00000284686, NOTCH4, MEF2C, ADGRL4, ADGRF5, PLPP1, ARHGEF3, ENSG00000289873
## RAPGEF5, FMNL3, SPRY1, RASGRF2, ENSG00000286508, SH3BP5, ENSG00000248636, KALRN, FLT4, HECW2
## Negative: EFNA5, SQSTM1, TRIO, WWC1, DCBLD2, IGFN1, PRKCA, NHS, AKAP12, HAVCR1
## RUNX2, GALNT14, ABCC4, PLEKHA5, ITGA3, SRGAP1, TUSC3, LINC01322, NAP1L1, FRMD5
## KIF21A, UGCG, NDRG1, CARMIL1, CD70, ME1, ABCC2, PTPRJ, CDR2, TPM1
## PC_ 4
## Positive: MAT2A, ANXA2, SNHG1, ADM, TGM2, ENO1, CDCA8, RUNX1, YWHAQ, HCFC1
## CENPE, SPSB1, ABTB2, KIF14, DTL, SNAPC4, CRY1, HSP90AA1, RFWD3, KMT2D
## LMNB1, KLF10, PPP2CA, GAK, CCT2, EEF2, ATP13A3, TCOF1, ARRDC3, NAMPT
## Negative: LINC00472, PLEKHA5, TMEM178B, ENSG00000237813, DERA, SSH2, EPB41L4A, FLG-AS1, MBD5, FRMD5
## RGS20, MYLK, HMGA1P4, VWA8, PDGFC, FGD6, LINC00278, TIMD4, EXT1, CRPPA
## MAPK10, ENSG00000241962, ARHGAP29, LINC02646, RPS6KA2, LINC00342, SLC4A4, DGKH, LAMA3, XKR6
## PC_ 5
## Positive: TGM2, GBE1, PPARD, PHLDB2, FMNL2, HRH1, ITGA3, EFNA5, ANGPTL4, CA12
## ATP13A3, PXYLP1, NDRG1, PTPRE, SPSB1, SDK1, RASA3, ANO6, ABTB2, PTPN14
## PLEKHA1, GLIS3, SGMS2, LINC01320, ITPR3, MAML2, NCK2, LIF, NAV2, SEMA6A
## Negative: ACSM2B, CDHR5, ATP1B1, ENPP7P8, OSBPL11, ENSG00000206129, WDR72, MMADHC-DT, EBNA1BP2, BBOX1
## CEP120, GRB2, PNPLA7, MYO7B, DPYS, FBXL3, PLEKHF2, GPX3, PRPF40A, POLR2A
## SLC12A6, GATM, CLIC4, ACACB, MGAT1, SLC13A1, IRF1, APOL6, LINC02532, TMEM63A
## Warning: Requested number is larger than the number of available items (168).
## Setting to 168.
## Warning: Requested number is larger than the number of available items (168).
## Setting to 168.
The dimensionality of the dataset to use for downstream analysis must be decided. This is the number of principal components believed to capture the majority of true biological signal in the dataset. Using too many PCs usually does not affect the result too much, while including too few PCs can be detrimental, as meaningful biological variation may be lost. Therefore, it is better to be overgenerous when defining the dimensionality of the data than to underestimate it.
The dimensionality can be decided by consulting the elbow plot (see
below) and manually picking a value (set by argument
n_dims) or it can be calculated using the
intrinsicDimensionality package.
No value was provided for n_dims, so dimensionality will be calculated using the intrinsicDimensions package.
The dimensionality of the dataset can also be estimated using the
intrinsicDimension package. This bypasses the manual approach of reading
the elbow plot, but can sometimes be a bit too conservative. Therefore,
if n_dims is not defined by the user, the dimensionality is
set as double the intrinsicDimension estimate.
n_dims_iD <- intrinsicDimension::maxLikGlobalDimEst(
seu@reductions$pca@cell.embeddings,
k = 10
)$dim.est %>% round(0)
generous_n_dims <- round(2 * n_dims_iD, 0)
final_n_dims <- generous_n_dims
Dimensionality set to 24.
A plot of all available groupings of the data, projected onto the three dimensionality reductions (PCA, UMAP, and tSNE).
From Clarke et al., 2021
Annotating cell states and gradients When analyzing and characterizing novel cell types, it is important to determine whether they represent a stable cell type or contain multiple cell states. The definitions of cell type and state are not yet standardized, but a stable cell type may be expected to have homogeneous gene expression across a cluster and be compact in a 2D projection plot, whereas cell gradients appear as a spread-out string of cells and cell states (e.g., cell cycle state). Expression gradients indicate continuous differences that are present in the cell population, which could represent states like the cell cycle, immune activation63, spatial patterning64 or transient developmental stages. Care must be taken to distinguish biologically meaningful cell states and experimental batch effects, which can manifest in a similar way.
We perform unsupervised transcriptome similarity-based clustering to identify groups of nuclei with relatively homogeneous transcript profiles (united by a shared celltype / state). Clustering is performed at different resolutions to explore how groupings change across the parameter space. The aim is to find a resolution at which the clusters appear to best reflect true biological subpopulations of the dataset.
This process is based on the assumption that true discrete clusters of nuclei do exist within the dataset. The preceding nucleus filtering, feature selection, and dimensionality reduction steps were taken to distill only the highest quality features that reflect the underlying structure of the population and to remove noise that distracts from this structure.
As with the final dimensionality parameter, there is no deterministic way to set the final clustering resolution and no definitive best value. Instead, one must qualitatively assess clustering outputs for the given dataset and pick a ‘good’ resolution. Typically, a ‘good’ final resolution is taken as one that groups nuclei into celltypes. However, looking at multiple clustering resolutions of the same dataset can provide multiple true and valuable conclusions about the dataset at multiple levels of functional subspecialisation of cells.
Setting a low clustering resolution will produce fewer clusters, reflect the broadest subdivisions between groups of nuclei (e.g. different tissues / differentiation lineages), and is more robust to noise, but can miss important local structure within the dataset.
Higher clustering resolutions will result in more, increasingly granular clusters and can reveal celltype-level groups. In can also reveal new / rare celltypes that are hidden at lower resolutions. Setting the clustering resolution too high, however, can produce meaningless groups and lead to overfitting.
The first step is to compute the k.param nearest
neighbours for a given dataset. This step constructs a shared nearest
neighbours (SNN) graph from euclidean distances between nuclei in PCA
space, and then computes edge weights between each pair of nuclei based
on their Jaccard similarity (the shared overlap of their local
neighborhoods).
seu <- Seurat::FindNeighbors(seu, dims = 1:final_n_dims)
## Computing nearest neighbor graph
## Computing SNN
Then, groups of nuclei are iteratively grouped together. Clusters are
identified by a SNN modularity optimisation-based clustering algorithm.
This procedure is repeated for all clustering resolutions to be tested
(set by the clustering_resolutions argument).
The clustering resolutions to be tested are 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8.
seu <- Seurat::FindClusters(seu, resolution = clustering_resolutions, verbose = F)
A final clustering resolution can be decided by looking for sensible groupings of nuclei projected onto the dimensionality reduction plots.
Alternatively, the clustree package provides a more
intuitive way to visualise clustering behaviour at different resolutions
and to devide on a final clustering resolution. It constructs clustering
trees that show how samples move and split up as the number of clusters
increases. A ‘solid’ clustering resolution is taken as one in a part of
the tree where clusters remain relatively stable and constant for a few
resolution. This indicates that the groups are robust and strongly
supported by the data, rather than volatile and noisy.
The final clustering resolution has been set to 0.3.
seu$cluster <- seu@meta.data[, paste0(snn_res_prefixes, final_clustering_resolution)]
# save seu_clustered
saveRDS(seu, here::here(out$base, "seu_clustered.rds"))
The conventional approach to celltype annotation for sc/snRNAseq datasets is to inspect known marker genes from each of the celltypes that are expected and likely to be found at the sample site. Known relationships between marker genes and celltypes of interest can be obtained from databases or manually curated from literature. Then, the expression profiles of all markers from all celltypes across all clusters is programmatically or visually inspected and clusters are then labelled according to their marker profile.
The package SingleR returns “the best annotation for
each cell in a test dataset, given a labelled reference dataset in the
same feature space”. Using the human primary cell atlas from
celldex, we can find the nearest reference celltype to each
nucleus in the dataset. This approach is good because it is unbiased and
does not require manual assignment, but it is unclear how well snRNAseq
output will map onto a scRNAseq atlas. Additionally, the reference
dataset from celldex is more rigid, cannot be customised to
the use case, and does not contain transformed celltypes.
human_primary_ref <- celldex::HumanPrimaryCellAtlasData()
## snapshotDate(): 2022-10-31
## see ?celldex and browseVignettes('celldex') for documentation
## loading from cache
## see ?celldex and browseVignettes('celldex') for documentation
## loading from cache
seu_singler <- SingleR::SingleR(
test = Seurat::GetAssayData(seu, slot = "data"),
ref = human_primary_ref,
labels = human_primary_ref$label.main
)
## Warning: Groups with fewer than two data points have been dropped.
## Warning in max(data$density): no non-missing arguments to max; returning -Inf
## Warning: Computation failed in `stat_ydensity()`
## Caused by error in `$<-.data.frame`:
## ! replacement has 1 row, data has 0
## Warning: Groups with fewer than two data points have been dropped.
## Warning in max(data$density): no non-missing arguments to max; returning -Inf
## Warning: Computation failed in `stat_ydensity()`
## Caused by error in `$<-.data.frame`:
## ! replacement has 1 row, data has 0
## Warning: Groups with fewer than two data points have been dropped.
## Warning in max(data$density): no non-missing arguments to max; returning -Inf
## Warning: Computation failed in `stat_ydensity()`
## Caused by error in `$<-.data.frame`:
## ! replacement has 1 row, data has 0
## Warning: Groups with fewer than two data points have been dropped.
## Warning in max(data$density): no non-missing arguments to max; returning -Inf
## Warning: Computation failed in `stat_ydensity()`
## Caused by error in `$<-.data.frame`:
## ! replacement has 1 row, data has 0
Some annotations will contain very few nuclei. These are usually not of interest and clutter up plotting, so we remove annotations assigned to < 10 nuclei.
seu$singler_annot <- seu_singler %>%
dplyr::as_tibble() %>%
dplyr::group_by(labels) %>%
dplyr::transmute(singler_annot = replace(labels, dplyr::n() < 10, "none")) %>%
dplyr::pull(singler_annot)
Literature markers for celltypes of interest in ccRCC tumours